CAGEF_services_slide.png

Lecture 04: Data visualization with the ggplot2 library


0.1.0 About Introduction to R

Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style - it is 100% hands on! A few hours prior to each lecture, links to the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with R by coding along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.

0.1.1 Where is this course headed?

We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:

data-science-explore.png

0.1.2 How do we get there? Step-by-step.

In the first two lessons, we will talk about the basic data structures and objects in R, get cozy with the RStudio environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), subset and merge data, and generate descriptive statistics. Next will be data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. After that, we will make all sorts of plots for both data exploration and publication. Lastly, we will learn to write customized functions and apply more advanced statistical tests, which really can save you time and help scale up your analyses.

Draw_an_Owl-2.jpg

The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.


0.2.0 Class Objectives

This is the fourth in a series of seven lectures. Last lecture we finished up with basic manipulation of data frames with the help of the tidyr package. This week we are taking a break to enjoy the fruits of our labours. Now that we can make properly formatted data frames, we can use these objects as input to produce beautiful, publication-quality data visualizations with the help of the ggplot2 package. This week our topics are broken into:

  1. Introduction to ggplot and the grammar of graphics using scatterplots
  2. Exploring other types of plots
  3. Customizing your plots
  4. Saving your plots
  5. Taking your plots up a notch by combining them and using additional packages.

0.3.0 A legend for text format in Jupyter markdown

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn R

0.4.0 Lecture and data files used in this course

0.4.1 Weekly Lecture and skeleton files

Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2021-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.

Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.

0.4.2 Live-coding HTML page

A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!

0.4.3 Post-lecture PDFs and Recordings

As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus. A recorded version of the lecture will be made available through the University's MyMedia website and a link will be posted in the Discussion section of Quercus.


0.4.4 Dataset 1: Latrines

Sequencing of the V3-V5 hypervariable regions of the 16S rRNA gene

16S rRNA gene amplicon sequencing of 30 latrines from Tanzania and Vietnam at different depths (multiples of 20cm). Microbial abundance is represented in Operational Taxonomic Units (OTUs). Operational Taxonomic Units (OTUs) are groups of organisms defined by a specified level of DNA sequence similarity at a marker gene (e.g. 97% similarity at the V4 hypervariable region of the 16S rRNA gene). Intrinsic environmental factors such as pH, temperature, organic matter composition were also recorded.

We have 3 csv files.

0.4.4.1 Dataset 1: data/metadata_pitlatrine.csv

A metadata file (Naming conventions: [Country_LatrineNo_Depth]) with sample names and environmental variables.

0.4.4.2 Dataset 2: data/taxa_pitlatrine.csv

An OTU abundance table containing the numbers of operational taxonomic units sampled from various sites.

0.4.4.3 Dataset 3: SPE_pitlatrine.csv

A wide-format data set of the pit latrine data containing the Taxa abundance information across various sample sites.

0.4.4.4 Data source

B Torondel, JHJ Ensink, O Gunvirusdu, UZ Ijaz, J Parkhill, F Abdelahi, V-A Nguyen, S Sudgen, W Gibson, AW Walker, and C Quince. Assessment of the influence of intrinsic environmental and geographical factors on the bacterial ecology of pit latrines Microbial Biotechnology, 9(2):209-223, 2016. DOI:10.1111/1751-7915.12334


0.5.0 Packages used in this lesson

The following packages are used in this lesson:

Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.


1.0.0 Introduction to the Grammar of Graphics

One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005).The idea of grammar can be summarized as follows:

ggplot2.png

The grammar of graphics is a language to communicate about what we are plotting programmatically


1.1.0 The grammar of graphics with ggplot2

The grammar of graphics facilitates the concise description of any components of any graphics. Hadly Wickham of tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework. By following a layered approach of defined components, it can be easy to build a visualization. ggplot2 was made to interact well with tidy (long) datasets. If you are spending lots of time figuring out how to make a scatterplot, your data may not be in the correct format.

MajorComponentsGrammarGraphics.png The Major Components of the Grammar of Graphics by Dipanjan Sarkar

We can break down the above pyramid by the base components, building from the base upwards.

  1. Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?
  1. Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.
  1. Scale: do you need to scale/transform any values to fit your data within a range? This includes layers that map between the data and the aesthetics.
  1. Geometric objects: how will you display your data within your visualization. Which geom_* will you use?
  1. Statistics: are there additional summary statistics that should be included in the visualization? Some examples include central tendency, spread, confidence intervals, standard error, etc.
  1. Facets: will generating subplot of the data add a dimension to our visualization that would otherwise by lost?
  1. Coordinate system: will your visualization follow a classis cartesian, semi-log, polar, etc. coordinate system?

Let's jump into our first dataset and start building some plots with it shall we?


1.2.0 Scatterplot

Let's build our first plot step by step to learn more about how ggplot2 works. We will begin by loading datasets from our 16S rRNA V3-V5 hypervariable region sequencing.

Let's read our data tables in and also load ggplot2. We will be using a variety of packages today that are grouped in tidyverse, so let's load tidyverse. You can see in the package startup message that ggplot2 is one of the attached packages.

Specifying column types with read_csv(): In lecture 02 we allowed read_csv() to directly import our files and make an educated guess on what kind of data was held within. Above we have used the col_types argument and set it using a string of characters that denote shorthand representation for the data type each column. We can use c (character), i (integer), d (double), l (logical), f (factor), and much more! Just be sure you know the column types for all columns in your input!

1.2.1 Every ggplot object needs data

We're going to build this first plot layer by layer and that begins with specifying the data source. In this case, let's use metadata to start off our plot. When we see it print, you'll find that there's nothing much displayed as output.


1.2.2 Every ggplot object consists of many parameters

We have, however, created a ggplot object. This is a list of 9 parameters:

  1. data
  2. layers
  3. scales
  4. mapping
  5. theme
  6. coordinates
  7. facet
  8. plot environment
  9. labels.

Luckily there are some defaults, so we don't have to specify everything, but you can start to see how ggplot objects are highly customizable. So far, we have only specified the data aspect of this object.


1.2.3 aes() determines attributes of the mapping list and how data is displayed

The next step is to choose the data we are plotting (aesthetics) and how it influences the visualization. At this point the data can be scaled directly and the axes appear. We have not yet specified how we want the data plotted, only which data should be plotted. In practice, people usually omit 'mapping = ', but it is a good reminder that mapping is, in fact, what we are doing.

When we start customizing our plot, our code starts to get a bit harder to read on one line. We can create each specification on a new line by ending each line with a +.

For our plot, we'll specify the x and y axis using data from the TS (total solids) and CODt (total chemical oxygen demand) variables.


1.2.4 Visualize our data as points on the graph with geom_point()

We now have to choose the geometric object (geom) with which to plot our data, in this case a point. A geom could be a line, a bar, a boxplot - you can type geom_ and then Tab to see all of the available options. Autocomplete can also be helpful for remembering syntax.

Some helpful geom commands:

Command Geom description Used for
geom_point() Single points of data plotted on an x and y axis scatterplots, dotplots, bubble charts
geom_bar() Barchart summarizing data based with heights proportional to size of its group barplots and stacked barplots
geom_col() Barchart summarizing data where with heights representing values in the data barplots of data values?
geom_boxplot() Produce a rough visualization of data distribution boxplots
geom_line() Track values of multiple groups along an x-axis such as time line graphs
geom_jitter() When datapoint overlap too much, you can spread them out using jitter Helpful for boxplots
geom_violin() Combines a kernel distribution estimate in a boxplot-style format Known as the violin plot

Since we are making a scatterplot we'll want to go with the geom_point() function.


1.2.5 Specify colouring of groups through aes() based on factors

The data looks like there may be two groupings. My guess would be that this is for the 2 different countries. We can easily test this by colouring our points by Country. Look at the structure of metadata in either the Global Environment or using str(). To do this in R, we want to base our colouring on levels from a factor. Afterwards a legend will be automatically created for you.

To accomplish this, we first need to make sure that "Country" is a factor.

The aes() parameter of colour can be mapped to Country when we first specify 'x' and 'y' . By now you've noticed that we have been setting specific attributes in an order that matches our diagram of the grammar of graphics pyramid.


1.2.6 Most layers will, by default, inherit from aes() unless explicitly specified

When setting the aes() parameter there are generally three ways to do this in order of inheritance

  1. ggplot(data = ... , mapping = (aes(x = ..., y = ..., colour = ...)))
  1. (aes(x = ..., y = ..., colour = ...))
  1. geom_*(aes(x = ..., y = ..., colour = ...))

This means that colour can be specified using geom_point(aes()) since it is a description of the points being plotted. When building a plot, using this command will supercede the plot's default mappings (if any were created and inherited). By placing version 2 into our code, at the beginning of our plot, we are essentially overriding the default mappings, which are nothing. I prefer to write the code this way for easier reading but method 1 is the more formal way of setting a default mapping to your plot.

It is rare that you might use 3 but not impossible. Especially when layering multiple geom_*() objects, you may find that you want them coloured in one way, but shaped or sized based on a different factor. Setting the default mappings at the start, takes away having to input this information into each new layer of your ggplot object.


1.2.7 Altering the scale of our axes with scale_y_*()

Some of our data points seem to be crushed near the x-axis. We can scale the y-axis to fix this. There are a number of ways to specify how the axes of our graph can be scaled. This is usually accomplished through the commands

scale_y_*() and scale_x_*() where * denotes a number of options in R including

Within these commands we can further specify the axis name, limits (start and end), breaks (tick mark locations), labels for each break, and axis (not data) transformations. In this case, lets keep it simple and log-transform our y-axis.


1.2.8 Alter the scale of axes directly in aes()

Keep in mind that scaling does not change the data, but rather the representation of the data. The y-axis has been scaled. This is different than taking the log10 of the y data.

Can we transform our data directly? Yes, by manipulating the data in our specification of the y-axis data itself in our aes() call but take a close look at the resulting y-axis!

The placement of the points looks the same, but the first graph is scaling the axis while the second graph has changed the data to a log10 scale.


1.2.9 Take advantage facet_*() to display multiple conditions in separate panels

Faceting allows us to split our data into groups to display in a more separated fashion. Note that we have removed the colour specification in our groups as splitting the data into separate graphs accomplishes the same distinction.

It is good data visualization practice to only have one attribute (colour, shading, faceting, symbols) per grouping.


1.2.10 Updating the image size in Jupyter through repr

You may have noticed that when your legends are quite large, you lose some real estate on your actual graph. This is due to how we output the graphs both when saving and in displaying. In Jupyter, the standard output dimension is a 7x7inch graph.

When displaying your graphs in Jupyter you can update options through the repr library to widen or lengthen graphs as you create them with big legends or multiple facets. We'll talk about the process of saving them soon as well. First, let's fix our previous graph.


1.2.11 Plots can be coloured using continuous variables

I could now add information from another variable as a colour in this plot. Note that if a variable is continuous instead of discrete, the colour will be a gradient.


1.2.12 Discriminate your variables by shape in the aes() parameters

If we really wanted to we could add another variable to the plot by changing the shape attribute. Let's change Country to having 2 shapes and facet by the Depth of the latrine instead.

Note that shape can only be used for discrete values.

A quick reference key for shapes can be found in the 'Cookbook for R' (http://www.cookbook-r.com/Graphs/Shapes_and_line_types/).


1.2.13 There are numerous ways to specify facet_*()

Note that up until now we've been using facet_grid(~variable) to split our data by variable. This annotation causes the grids to be distributed horizontally. Other ways to facet by a single variable are:

We can now note that only Tanzania had sampling greater than 4cm of depth. There are single latrines for 4 samples. There was no latrine at a depth of 11cm. Lack of replication and a bias towards Tanzania for the higher depths is something we should keep in mind while looking at this data. Depending on the question we are trying to answer, we may want to remove some of these data.

One thing that is not necessary in this case - but good to know about - is the ability to allow each grid to have its own independent axis scale. In our example, wells of Depths from 1-4cm have up to 1000 CODt, while the other wells barely have values past 100 CODt. This can be changed, but keep in mind most people will assume all grids have the same scale, so take extra care to point out that the scales are different when presenting or publishing.


1.2.13.1 What issue do we encounter when faceting by "Depth"?

Have you noticed something strange about the past two versions of our graph? When faceting by "Depth", how are our groups being ordered? Why?


1.3.0 Add regression lines using statistical transformations

You can also add statistical transformations to your plots. Again, take a look at stat_ then Tab to see the list of options. In this case let's separately fit a linear regression line to CODt vs TS for each country. The grey area around the line is the confidence interval (default=0.95) and can be removed with the additional call to stat_smooth of se = FALSE.


1.3.1 Play around with regression models and use the alpha parameter to de-emphasize data

A linear model is not always the best fit. The method of calculating the smoothing function can be changed to other provided functions (such as loess - short for local regression, used below) or can be a custom formula. We'll talk more about making our own models in Lecture 06! Note that I changed the confidence interval by modifying level=0.8.

geoms can be made more transparent with the alpha parameter, which is set to 0.3 in the following code so that the emphasis is on the regression line rather than the points.


2.0.0 Exploring different types of plots

Now that we have some of the basics, it's time to take a closer look at using other types of plots. We'll begin by reviewing the latrines dataset after loading it into memory as the variable latrines.


2.1.0 View the theoretical distibution of your data with density plots

What is the distribution of OTUs by sample site in each country? We can quickly visualize this property by making a density plot for OTUs by Country. I have set alpha (transparency) to 0.3 so that we can see both countries on our plot.


2.1.1 Set the limits of your axes with xlim()

The first thing to notice is that everything is clumped at 0. This is because we have not filtered our data frame to remove all observations where OTUs are zero. The other thing to notice is that there is a long tail where there will only be a few observations. It will be necessary to change the x-axis to see our data. This is done by setting 'limits' (lower and upper boundaries) on the x-axis with xlim(). Note that R gives us a warning that we are not viewing 158 of our 4212 rows.

Let's do the following:

  1. Filter our data to have at least 2 OTUs.
  2. Set upper and lower boundaries with xlim().
  3. Add a geom_rug so that we can see where each value falls along the distribution.

2.2.0 Histograms can similarly visualize our distribution

Unlike density plots, histograms count the number of observations you have in each 'bin' that you specify. So with proper parameters you can recreate a similar shape to your density plots using only the observed data.

Binwidth: The geom_histogram() function uses a default binwidth of 30 units, which means your data will be divided into a new bin every 30 units along your x-axis. The geom itself is agnostic to your data, its values, or the meaning (units) of those values. This is simply a default behaviour and you should change it yourself. R will even warn you to change your binwidth.

2.2.1 Use the position parameter to alter how data is stacked

Instead of having the countries information stacked, we may want to see the data side by side. This can be done with the parameter position set to dodge. Let's do the following:

  1. Set a larger bin width
  2. Dodge our data
  3. Add a y-axis limit with ylim()
  4. Add a geom_rug()

2.2.2 Barplots stack our categorical data for us.

Can we create a bar plot of OTUs per country? With geom_bar() and the proper aes() we can fill in colour along the bar to represent specific Taxa.

The default use of geom_bar() is to create a barchart where the height of each bar is proportional to the number of observations (ie rows in Taxa) for a particular group (ie Country). The default argument for this calculation in geom_bar() is stat="bin". We can, however, use the sum of values of a variable by using stat=identity instead but a y variable must be identified.


2.2.3 Control the order of categorical variables in your legend with fct_reorder2()

In the legend for the bar plot our factor levels for Taxa are in alphabetical order. However, an ordering that might be more useful would be an order that matches our data, for example, Taxa in descending order of OTU abundance.

To do this, we can use the forcats package (included in tidyverse and already loaded). fct_reorder2() takes the factor we want ordered (Taxa) and associates the highest x values (Country) with their y values (OTUs) ordering them in descending order. This is most relevent when working with line graphs but works for our bar graphs too.

We will talk about customizing plots later in this lesson, but the last line of code is one way to remove the legend title (by making the legend.title 'blank').

Now our highest abundance Taxa is the first value in our legend, and this matches the order of the Taxa in our bar graph. The legend order now has meaning.


2.2.4 You can dodge your barplots too!

You can alternate between 'stacked' or 'dodged' (as we did with the histogram) for whether your bars are on top or next to each other when splitting by a factor or categorical variable.


2.2.5. Flip your data to display horizontally with coord_flip()

Our data looks quite squished when displaying the bars vertically. You can have your bars horizontal instead of vertical by using the coord_flip() layer.


2.3.0 Boxplots provide visual summary statistics of your data

Boxplots are a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles). Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.

We are going to use boxplots to see the distribution of OTUs per Taxa across all samples.

2.3.1 Rotate axis text by updating theme() and angle

So we can immediately see there are some issues with the plot. Text along the x-axis is overlapping and illegible. Let's fix the text on the x-axis by rotating it 90 degrees. To accomplish this we will use the theme() layer.


2.3.2 Justify the axis text through theme() and hjust

We've updated the angle of our text but they're positioned on somewhat of a "centred" alingment. We can justify the labels such that they align with the x-axis.

You may ask why this is a horizontal (hjust) justification, when it seems like moving the labels upwards towards the x-axis should be a vertical (vjust) justification. If you look in the help menu at element_text() you will see that the justification is carried out before the rotation. While we can specify the parameters of element_text() in any order, this does not change the order they are executed in the function.


While it is clear that Clostridia is the most represented taxa, it is difficult to tell whether some other taxa have no representation, or if they simply have low representation. Transforming to a log scale on the y-axis may clarify this question for us.

Do you know why were are getting the warning: Transformation introduced infinite values in continuous y-axis? What if we were working with filtered_latrines instead?

Beware log transformations: Remember if you do transform your data values, always remember to consider the range of your values to ensure that you get a proper result. Here we see that ggplot automatically removed some values for us.

2.3.3 Slice your data to display what you'd like

For now, let's just keep the Taxa that are common to both countries and have greater than 1 OTU counts. To do this we can revisit our favourite dplyr verbs:

  1. filter for rows with greater than or equal to 2 OTUs (already completed in 2.2.0 with filtered_latrines
  2. group_by() Country and Taxa to get Country-specific information about Taxa
  3. count the number of entries, filter() Taxa that are shown to appear in more than one sample.
  4. We then need to ungroup() to select our final list or we will end up with a grouped object.
  5. We use the duplicated() function to retain those Taxa that were found twice (ie. in both countries).

The %in% function returns a logical vector of whether there is a match or not for its first argument in its second argument. It is useful for subsetting.

We can facet our plot by Country to verify we have only commonly appearing OTUs.

We will be using this graph as a base for customization later in the lesson.


2.4.0 Beeswarm Plots show us all of our data points

Even though boxplots give us summary statistics on our data, it is useful to be able to see where our individual data points are. We've already used geom_rug() to help visualize our data distribution in density plots.

Similarly, for a boxplot we can add the data as another layer using geom_point() to place dots on top of our boxplot, or use geom_jitter() to spread our points out a bit. However, a beeswarm plot places data points that are overlapping next to each other, so we can get a better picture of the distribution of our data. We'll start off by looking at the geom_beeswarm() function from the ggbeeswarm package.

I have subset the data to make a boxplot with 3 Taxa and 1 Country so that we can see the differences in geom options a bit better.

Filter out those 0 values! Remember how I just warned your about log transformations? The ggbeeswarm() package has some issues with -inf values so we'll have to filter them out when we subset our data!

2.4.1 Store ggplot objects as variables that you can continue to update

As you can see above, an option with ggplot2 is to save your plot into a ggplot object. This works well if you know you are only changing one or two elements of your plot, and you do not want to keep retyping code. What we are going to vary here is how the data points are displayed.

Now, we can simply overlay the points with geom_beeswarm(). Notice that this geom comes from the ggbeeswarm package and is not a part of ggplot2 itself. However, it was built to work with ggplot2 objects!


2.4.2 cex is a common parameter used to adjust plotting properties of text and symbols

Depending on the function or geom you may often find that the cex parameter can be adjusted to alter some aspect of how a geom or other graphical layer is displayed. In the case of geom_beeswarm() we can increase the spacing between data points to make its distribution a bit clearer.


2.4.3 Reduce overplotting with geom_quasirandom()

If you think you will have many points to display or if you want to avoid adjusting parameters with each new plot, consider using a geom_quasirandom() to give the empirical distribution of the stripplot to avoid overplotting. It is a geom included with the ggbeeswarm package and can simplify the look and creation of your plots

Other spacing and distribution options are available at https://github.com/eclarke/ggbeeswarm.

quasirandom_vs_simple_beeswarm.jpg


3.0.0 Customizing your plots

Up to this point, we've discussed some of the ways to alter our plots in ways to best visualize our data by dabbling with colour, fill, shape, and position. This has been accomplished mostly through the aes() attribute. Depending on the nature of the layers or elements you add, we can alter their characteristics indidually to customize.

Plot elements relating to your data include things like axis labels, titles, colour or shapes that represent subsets of your data, scaling that is data-dependent, legends, and other data-driven parameters.

For customizing your data it is possible to change:

Titles and axis labels can be added using:

We have seen in the above examples that colour can be applied to discrete or continuous variables. We can also use colour (shape, etc.) to represent outliers. In this dataset, outliers beyond the whiskers (above or below 1.5*IQR) can be coloured red.

Let's review our base boxplot of shared taxa between countries from section 2.3.3. We'll quickly update it to include some colour:


3.1.0 Adding a title and axis labels

Let's start off by sprucing up our plot with

I also want to add a title, modify the y-axis label to say that the data is log, and remove the x-axis label. Note that I remove the x-axis label by using 'NULL'.


3.1.1 Use the labs() command to control multiple labels

Use individual commands to alter the x-, y-axis titles and the title of your plot can give you control over aspect of each individual element like font, size, and colour. If you want them to all have a uniform aesthetic, you can simply use the labs() command. This can include legend titles too!


3.1.2 Assigning or altering labels on your plot

I also want to capitalize the Country labels. This can be done in a couple of ways.

One way would be to change the values in the dataset using string manipulation. A second way, would be using the labeller() function. I can make a vector of the capitalized names to replace 'Tanzania' and 'Vietnam'. The data is split by Country in the facet_grid() and this is where we pass our labels to labeller(), which will output the names on the strip label.

I am now going to save this plot in a ggplot object, since we are going to use this as our base plot for the next section. I am also going to rotate the x-axis text again. We will talk about theme elements soon, but in the meantime it won't drive me bonkers.


3.2.0 Colour palettes!

colour_visualizations.jpg

A common thing to want to do is to change colours from ggplot2's default rainbow palette. There are many reasons to change a colour palette including

Let's create our own colour palette for Taxa.

3.2.1 A Note on Colour Palettes

There are 3 main types of colour palettes.

3.2.1.1 Use sequential colour palettes to display low to high values

Sequential - implies an order to your data - i.e. light to dark implies low values to high values, e.g. heatmaps.


3.2.1.2 Use diverging colour palettes to highlight the middle and extremes of a distribution

Diverging - low and high values are extremes, and the middle values are important - still goes from light to dark, but 3 colours mainly used.


3.2.1.3 Use qualitative colour palettes for categorical data

  1. Qualitative - there is no quantitative relationship between colours. This is usually used for categorical data.

3.2.2 Add a colour palette to a plot like a layer

Let's test one of the RColorBrewer palettes out on our data. We'll add it as a layer to my_plot using scale_fill_brewer() to override the fill mappings defined in the aes() layer of the plot.


3.2.2.1 Colour palettes are not vector recycled when plotting in ggplot

Notice the warning we received: "n too large..."? Note that we have 24 different taxa along our x-axis but the Spectral palette only has 11 colours. Unlike when we see vector recycling in previous lectures, this does not occur when supplying a colour palette to our plot. In generating our plot, we only colour the first 11 colours.


3.2.3 RColorBrewer colour palettes can be created with brewer.pal()

Many colour palettes now exist. I'll showcase a couple that work nicely with ggplot2. These packages also have colorblind friendly options. RColorBrewer has options for these 3 types of palettes, which you can see with display.brewer.all(). With a smaller dataset, we could make a call in ggplot directly to scale_fill_brewer(), which just requires choosing one of RColorBrewer's palettes, such as "Spectral". However, we have 24 Taxa and these palettes have 8-12 colours, so we have to get creative.

Using the brewer.pal() function, we can pull different colours from palettes of our choosing. In our case, I have simply taken the 2 qualitative palettes that each have a length of 12, put them into one palette, and made sure my values were unique. This can then be passed to ggplot via scale_fill_manual().


Looks like we can use the Paired and Set3 palettes since they both have 12 colours that seem distinct enough. There may be some close colours though.


Looks like we have enough colours to satisfy our needs. Notice that these are coded using a hexadecimal system? Let's provide this vector as input.


3.2.4 You can always pick your own colours

You can always choose a vector of your own colors using this 'R color cheatsheet' .

Hexadecimal colours: The RGB colour scheme is represented by 3 colour values (Red, Green and Blue) using a colour scale between 0-255 for each. This blending of shades produces the colours we see and can be represented by a Hexadecimal value 000000-FFFFFF. Use an RGB colourpicker if you are obsessed with picking your very own colour palette.

If you just want a repeating patterns of colours, you can use the rep() command to help you out too!


3.2.5 Colour blind accessible palettes can be found in the viridis package

Sometimes you may wish to work with a colour palette that best represents a continuous series of diverging values. In this case you may also want to ensure your colour palette avoids issues for readers that are printing in greyscale or those that may be colour blind. The viridis package contains some colour blind accessible palettes that can also help to really differentiate between the extremes of your spectrum.

The viridis package also has some nice color options (https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html). While these might all be diverging palettes (qualitative is best for our Taxa), we will showcase a couple here.

RSkittleBrewer is another option for funky colour palettes. ggsci has a variety of color palettes inspired by different scientific journals as well as television shows (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html).


3.3.0 Theme: attributes unrelated to your data

As mentioned earlier, it is possible to customize every single aspect of a ggplot. Most of this occurs with a call to theme(), which you can think of as modifying everything BUT your data. For example, my axis labels can be modified, but they (hopefully) have something to do with my data. However, changing the size of the text or the font of the labels is unrelated to my data, and the same structure (text font & size) could be carried over to other plots if I saved my own theme.

Things that you can change with theme() include the axis, legend, panels, gridlines, or background.

Each element of a theme inherits from one of:

ggplot2 comes with some themes - I suggest starting with the one that is close to what you want, and start modifying from there.

Check out these themes:

You can look at the default for each theme simply by typing it into the console.

And this is what theme_bw() practically looks like:


3.3.1 Remember that attribute changes are overridden by order of appearance

Notice how that last call to theme overrides my previous call to theme if there is a conflict. In this case the angle of the x-axis text is returned from a vertical to a horizontal orientation.

Here is an example of theme_dark(). I am going to override the default x-axis text angle of this theme by modifying it AFTER I call theme_dark().


3.3.2 More themes are found in the ggthemes package

ggthemes is a package of themes. Some of these themes are based off of graphs seen in print or on websites (the economist, wall street journal, fivethirtyeight) or to match standard tools (excel, google docs).

See more themes: Information about the ggtheme options can be found at the Github homepage.

Here are 2 possible themes.


3.4.0 Make a customized theme

You can also make your own custom theme as demonstrated here: http://joeystanley.com/blog/custom-themes-in-ggplot2

I am going to show you how to customize a plot, starting from theme_minimal() because I don't like the grey backgrounds or harsh axis lines.


3.4.1 Fix your plot elements with the theme() layer

Depending on the layout of your plot you can institute changes to the theme as you build your plot or afterwards. Just remember, each call to theme() will override any previous calls that conflict, so the order of changes is important. Many arguments to theme() represent major element categories, but there can be arguments that specifically represent sub-categories or sub-elements.

Things I don't like about this plot and their solutions:

Problem Solution Layer / Command
x-axis labels overlap and are small rotate labels axis.text.x
country labels are smaller than axis labels change size and face strip.text.x
title is uncentered adjust position horizontally plot.title
need a border to separate countries create a border around each panel panel.border
add y axis ticks update y axis ticks axis.ticks.y
Theme layers are like onions: No, not smelly. There are just a lot of them. It isn't necessary to remember all of this syntax! It's certainly helpful but you can just bookmark the ggplot2 theme reference page instead.

As mentioned the last call to theme() will override previous calls that conflict. Therefore, if we want to start with theme_minimal() as our base, it has to be in our code BEFORE the other modifications.


3.4.2 Challenge: Alter plot background and major/minor gridlines

Alter my_plot by:

  1. Changing the background colour.
  2. Adding minor gridlines.
  3. Making the major gridlines black.

There are a lot of way to customize your plots! Keep exploring and playing with parameters!

3.4.3 Save your personalized themes to a variable

You may be wondering, can I save this awesome theme to apply to all my amazing plots? Yes, there are a number of ways to import your themes to other scripts if you learn to save your data objects to file in Lecture 07! For now, you can assign your themes to a variable and apply them to plots like any other layer.

Work smarter not harder: A key advantage to saving your theme to a variable is that once you save it, you can apply it easily to all of your plots but you can also update and tweak your theme in a single place within your code or notebok, rather than across multiple code cells, etc.!

4.0.0 Saving your figures

Up until now, we have taken for granted that our plots have been displayed using a Graphic Device. For our Jupyter Notebooks we can see the graphs right away and update our code. You can even save them manually from the output display but sometimes you may be producing multiple visualizations based on large data sets. In this case it is preferable to save them directly to file.

4.1.0 Graphics Devices

(https://rdpeng.github.io/Biostat776/notes/pdf/grdevices.pdf)

ggplot2 has its own function for saving its graphics: ggsave(). This allows us to skip the step of explicitly calling separate graphics devices and shutting them down afterwards (if you have saved plots in base R or lattice, this will sound familiar to you).

You can send the plot object to the screen device to preview your image, and then save that image by specifying the file device. If you do not specify the device type, ggsave() will guess it from your filename extension (pdf, jpeg, tiff, bmp, svg or png). Note that this will save whatever graphic was last on your screen device.

With ggsave() you can minimally input the filename you would like to have, and the path to your file.


However, in some cases you want to tailor your output. You can specify the width, height and units of your image, or you can apply a scaling factor (the 'eyeballing' approach). You can also specify the plot object you want to save instead of whatever was on your graphics device last using the 'plot' parameter. Note that this time I have combined the path with the filename, and called the file device type separately.

No image is sent to the screen device when a file is saved in this manner.


5.0.0 Taking it up a notch with the ggpub package

There are many fantastic R packages to analyze and visualize your data. As a group, we are likely working in a variety of specialized areas. The plots we have made so far today should be useful for data exploration for many different kinds of data. In the next section we are going to preview some more complex visualization types, but since these take more time to go through and not everyone may be interested in interactive graphics, network diagrams, time-series analysis, or geospatial data, we will not be plotting all of these together. We will, however learn how to arrange multiple plots per page, and also how to make an upset plot.

5.1.0 Multiple plots on one page (ie. for publication images) with ggarrange()

There are a variety of methods to mix multiple graphs on the same page, however ggplot2 does not work well with all of them. I am going to work with a package base called ggpubr which allows us to align the axes of our plots. This package relies on gridExtra(which allows us to arrange plots) and works well with ggplot2.

For a demonstration, we are going to take 3 plots that we made earlier (a boxplot, a histogram, and a dot plot), save them as objects, and then arrange and align them in the same figure. (http://www.sthda.com/english/rpkgs/ggpubr/)

ggarrange() is a function from ggpubr that takes your plots, their labels, and how you would like your plots arranged in rows and columns. It takes the form of:

ggarrange(
  ...,
  plotlist = NULL,
  ncol = NULL,
  nrow = NULL,
  labels = NULL,
  label.x = 0,
  label.y = 1,
  hjust = -0.5,
  vjust = 1.5,
  font.label = list(size = 14, color = "black", face = "bold", family = NULL),
  align = c("none", "h", "v", "hv"),
  widths = 1,
  heights = 1,
  legend = NULL,
  common.legend = FALSE,
  legend.grob = NULL
)

Of the parameters some relevant ones for us are:

  1. ... - the list of plots to be arranged as a grid or alternatively use
  2. plotlist - An optional list of plots to display
  3. labels - An optional list of labels for each plot
  4. ncol - number of columns in the plot grid (optional)
  5. nrow - number of rows in the plot grid (optional)

Some examples of simple grid arrangements are :

ggarrange_examples.png

To start, we want our boxplot and dot plot side by side. If you picture each plot as a square in a grid, we need two columns (one for each plot, ncol = 2) and one row (since they are side by side, nrow = 1).


5.1.1 Move plot elements around through the theme() layer

While the grid areas are of the same size, the backgrounds are not. Let's adjust the legend of our histogram so that it is in the top right corner of the plot, and remove the white background.

Things to note:

  1. The legend.position parameter uses a tuple where each value is {0,1}. (0,0) is the lower-left and (1,1) is the upper-right of a graph. You can also specify "left", "right", "top", and "bottom" for positions outside your graph.
  2. The legend.background parameter and others can be set to elements like a element_rect() but they can also be removed using the placeholder element_blank().

5.1.2 Arrange plots within plots

Next we will add in the boxplot by nesting a ggarrange() call within another.

Imagine a square with 4 boxes.

  1. We are going to put our boxplot in the left-hand side across the top and bottom boxes
  2. The histogram will be placed in the top right box
  3. The bubble plot goes in the bottom right box.

To do this, we are arranging 2 columns (one with the boxplot and one with the [histogram + bubble plot], ncol = 2) and we are arranging 2 rows (one with the histogram and one with the scatterplot, nrow = 2).


5.1.3 Small changes can be made with align and font()

If y-axis lines or x-axis lines are not aligned, this can be fixed with a call to align = "v" or align="h". Note that this will align the edges of the plot object, not the panels that represent data alone.

To make sure all axis titles are the same size (B and C look a little off to me), we can use font() to specify which text we want changed and the size we want to change it to. I am also going to make the legend title size the same.

Let's look at the font() function, which is actually part of the ggpubr package. You'll see that we can treat it much like adding a layer to our plots as we use the + operator. It acts like a wrapper to directly alter the ggplot object through underlying calls to the theme layer. Although limited, there are a number of elements that it can affect, including fonts for:

More about the font() function There are a few more basic elements you can alter through this function and you can find out more at the rdrr.io website.

Let's try out the font() function now and save the result into a new variable multiplot. You'll notice it's still not quite perfect but better in many places.


5.1.4 Save your multipanel plots with ggsave()

The ggarrange objects, while structurally different from ggplot objects, inherit much of their information and behaviours from the ggplot class. Therefor, you can use other ggplot functions like ggsave() to write your plots to file. The calls follow the same format as previous examples we've used so let's give it a try.


5.2.0 Challenge: Make a combined image with a panel across the top

Using the code available here for a density plot we made earlier in the lesson, make a combined figure. We are going to put our density plot across the top row, the histogram in the bottom left corner, and the box plot in the bottom right corner. Make sure the legend and axis titles are the same size. Change the legend text for the line graph to be smaller than the legend title.



5.3.0 Upset plots summarize your data

5.3.1 Venn diagrams, while popular, can be difficult to interpret

UpSetR: https://github.com/hms-dbmi/UpSetR

Upset plots are an alternative to Venn diagrams that show the intersection of sets, as well as the size of the sets. Additionally, Venn diagrams can be difficult to interpret for greater than 2 or 3 sets. This is a real life figure from BMC Bioinformatics. Sure it looks pretty, but what does the number 24 represent in this picture in terms of A, B, C, D, and E?

Fig2_Lam_BMCBioinformatics2016_17.png

</br>

5.3.2 Working with UpsetR to visualize overlapping datasets

Let's see how UpSet plots work practically.

Question: For 8 Taxa, how many sites are those Taxa present at, and what is the overlap between sites with other Taxa?

I have picked 8 different Taxa, which would make a crazy venn diagram with a lot of zeros in it since a couple of these Taxa are rare.


The data that we have is an OTU table. It is 8 rows (as expected) by 82 columns (81 sample entries)

For this purpose we simply want to know whether a given Taxa was present or absent in the sample, and then we can form intersections based on this information. So, for each OTU value that is not 0, we replace it with a 1 instead.


5.3.3. Conditional statements implement branched code

In our next step, as we try to replace all non-zero values with a 1, we are using our good friend apply() in addition with a conditional statement ifelse(). The ifelse() function allows us to test with a conditional and assign a value based on whether or not the answer is TRUE or FALSE.


5.3.4 Use the upset() function to generate an UpSet plot

Now that we've properly formatted our table subtax, it has 81 rows (samples) by 9 columns (8 taxa that we are investigating)

To use the upset() plotting function, we enter our data set, the number of sets we are inputting, if we want to order the results (in this case by frequency), and how many intersections we want to show. Here, I will show 15 intersections - we know the remaining intersections would be zero since this is ordered by frequency.

Watch out! Remember we said that the tibble and data frame were interchangeable for most cases? When we venture outside the tidyverse we may not be afforded the same courtesy. In the case of the UpSetR package, it much prefers to work with data.frames instead of tibble objects.

5.3.5 Understanding our Upset plot

Our plot can be broken into 3 sections.

  1. The left-hand barplot denotes the number of observations in each set/category.
  2. The bottom plot graphically represents the different combinations of each category up to nintersects.
  3. The upper barplot displays the number of occurences for the combination diplayed in the bottom plot.

From our result, our greatest intersection size is 28 instances. This means that 28 of our 81 samples have Clostridia, Alphaproteobacteria, and Deinococci present in the same sample WITHOUT Chloroflexi, both Acidobacteria, Cyanobacteria and Lentisphaeria.

We can see from the Set Size that Clostridia and Alphaproteobacteria are present in almost all samples, and Deinococci is present in greater than 3/4 of the samples. You might expect this overlap to be higher, but remember that this is without all the other Taxa.

Moving along our interesection sizes and dot matrix, we can see that other intersections include these bacteria. For example, the next bar to the right with an intersection size of 16, also includes Lentisphaeria. Some quick mental math shows that 63 samples have these top 3 Taxa present.


5.4.0 But wait, there's more!

While we have just scratched the surface of ggplot, as mentioned earlier in lecture there are many additional visualization packages that can work with more specific types of data. In some cases, these packages add functionality to the ggplot package itself!

5.4.1 Interactive graphics

Plotly: https://plot.ly/r/

ggvis: http://ggvis.rstudio.com/interactivity.html

Heatmaps: https://github.com/talgalili/heatmaply

Interactive time-series data: https://rstudio.github.io/dygraphs/

5.4.2 Network diagrams

visNetwork (based on igraph): https://datastorm-open.github.io/visNetwork/edges.html

5.4.3 Circos Plots

Migest: https://gjabel.wordpress.com/category/r/migest/

5.4.4 Geospatial Data

Static Maps:

Interactive Maps:

5.4.5 Phylogenetics Data

ggtree:

treeman:

metacoder:

phyloseq:

5.4.6 Genomics Data

ggbio:

GenVisR:

GenomeGraphs:

5.5.0 Just for fun...


Challenge

Install and load the gapminder package. Save the gapminder data into an object using latrines <- gapminder. Plot the following:

  1. a qualitative variable with a quantitative variable (at least 2 ways).
    • Facet your plot by the qualitative variable.
  2. two quantitative variables (at least 2 ways).
    • Add a regression line to your plot.

6.0.0 Class summary

That's the end for our fourth class on R! We took a break from data wrangling this week to focus on the basics of data visualization including:

  1. Understanding the grammar of graphics philosophy.
  2. Producing basic plots: scatterplot, barplots, and boxplots.
  3. Customizing plot elements and themes
  4. Saving your plots to file
  5. Arranging multiple plots into a single canvas.
  6. Packages outside of ggplot.

6.1.0 Post-lecture assessment (12% of final grade)

Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete all chapters from the Introduction to Data Visualization with ggplot2 course which has a total of 4 chapters and 4,300 points. This is a pass-fail assignment, and in order to pass you need to achieve a least 3,225 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the entire course summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course and you'll want to expand each section. It should look something like this:

DataCampIntroR.jpg

You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment section of Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, October 14th to submit your assignment (right before the next lecture).


6.2.0 Acknowledgements

Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and preprared in Jupyter Notebook by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


6.3.0 Your DataCamp academic subscription

This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.

Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.

DataCampLogo.png


6.4.0 Resources

Wickham, Hadley. (2010). A Layered Grammar of Graphics. Journal of Computational and Statistical Graphics.
Wilkinson, L. (2005), The Grammar of Graphics (2nd ed.). Statistics and Computing, New York: Springer. [14, 18]
Tufte, Edward R. The Visual Display of Quantitative Information.

http://www.cookbook-r.com/Graphs/
https://github.com/jennybc/ggplot2-tutorial
http://stcorp.nl/R_course/tutorial_ggplot2.html
http://ggplot2.tidyverse.org/reference/theme.html
http://joeystanley.com/blog/custom-themes-in-ggplot2
https://github.com/jrnold/ggthemes
https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf
https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/
https://github.com/eclarke/ggbeeswarm
https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html
http://www.sthda.com/english/rpkgs/ggpubr/
https://rpubs.com/drsong/9575
http://elpub.bib.uni-wuppertal.de/edocs/dokumente/fbb/wirtschaftswissenschaft/sdp/sdp15/sdp15006.pdf
http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/81-ggplot2-easy-way-to-mix-multiple-graphs-on-the-same-page/
https://rdpeng.github.io/Biostat776/notes/pdf/grdevices.pdf


7.0.0 Appendix I: Even more plots

Here I've compiled some additional examples of plots using ggplot to help get you started on working with this beloved package.

7.1.0 Circular plots

In ggplot2, circular plots are related to bar graphs - they just have different coordinate systems. The default coordinate system is cartesian coordinates, and we need to switch to polar coordinates to make a circle.

This circular barchart is known by a few names including the Nightingale plot, rose chart, and Coxcomb plot. Looking at the plot, pH levels increase as you move up the outer rings, and depth increases as you move clockwise around the circle. Colour is still represented by country.


7.2.0 Classic pie chart

Since we're on the topic of polar coordinates, to make your classic pie chart, use theta to specify what variable is going to be used to make up the angles (width of pie slices). To wrap to a full circle instead of having sections (as in the above Coxcomb plot), the width is set to one.


7.3.0 Lines

We briefly discussed line graphs in Lecture 03. Let's revisit how to make this visualization. To draw a line graph, we select geom_line().

We can colour the lines for the different countries. Looking at our previous graph, what must have happened when these values were over the same range?


7.4.0 Error bars

To plot error bars, we need to give the geom, geom_errorbar(), and the interval over which we want to draw the bar. This interval is usually one standard deviation, one standard error or a confidence interval.

Luckily, we have learned how to calculate the mean and standard deviation with dplyr. Once we have that, all we need to do is add or subtract the standard deviation from our data points to get the bounds of the error bar.

geom_errorbar() takes these bounds passed to the parameters ymax and ymin. In this case, since there were no sample replicates the standard deviation is taken from the mean of all of the soil readings in a given country at a given depth.


7.5.0 Strip plots: boxplots without the boxes.

The first plots we made were scatterplots with 2 continuous variables. With one discrete variable and one categorical variable, we can make a stripplot. We use the point geom for both of these types of plots.


Again, we have a lot of values crushed near the x-axis. If we add log scaling to the y-axis, we get an error that this transformation created infinite values. This is because we have zeros in our data set.

Again we get a warning that has to do with the conversion of our values to log10(OTU) which will produce infite values for OTU == 0. Knowing this, we could ignore the warning, or add +1 to each OTU (negligable on a log scale).


In order to see the points a little better, we can 'jitter' them. Jittering spreads out our data points while keeping them in the same area of the plot so we can get an idea of density.


7.6.0 Bubble Plots can visualize differences by point size

With bubble plots, we need to provide the size of our bubbles in a way that is proportional to our data, and provide a scale to map these to.

Let's group our OTUs by Country and Taxa and look at the Taxa for the top 30 OTUs. As a refresher, try to do this using dplyr functions.

We are going to make our bubbles (which can be thought of as large points) with geom_jitter() so that our bubbles don't overlap. Remember that we want to specify what we are plotting with aesthetics. We want the bubbles representing OTUs_per_Taxa, so we divide each value by pi when calling size.


However, it is hard to tell proportionally how different these sizes are. Mapping values to a scale gives us more of an idea of their relative sizes. This is done by giving a range of values to scale_size_continuous(). This range is the size you want your plotting symbols (bubbles) to be after transformation. You can use trial and error to see what range you like; it is not changing your data, but rather your perception of the data.


7.7.0 Labelling data points

ggplot2 provides 2 methods of labelling: geom_label() and geom_text(). Let's look at the help menu to find out what the difference between them is.

We have to specify with aesthetics what we wish to plot for our label.


This isn't fantastic because our labels are overlapping. What parameters might you be able to use to move the labels? Try using them to get the labels to not overlap.

The text geom has an option to check for overlap of labels. geom_text() does not provide a background for the label, and it may be harder to tell which label belongs to each data point.

What happened here? Why don't we have as many labels? Look in the help menu to explain this behaviour.

I don't like futzing with label positions, so I went looking for a package that would do this for me. ggrepel will 'repel' your labels away from each other without getting rid of them. Let's check it out with our bubble plot labels. Install and load ggrepel. The equivalent function we can use is geom_label_repel(). The force parameter allow you to modify how far you want your labels pushed away from each other.

Here, the box colour is being used to map to our data points but ggrepel can instead connect a line to data points by altering the value of segment.size. Variations on use can be found here: https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html.

CAGEF_new.png